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Abstract — Field Programmable Gate Arrays (FPGAs) 
have recently been increasingly used for highly-parallel pro- 
cessing of compute intensive tasks. This paper introduces an 
FPGA hardware platform architecture that is PC-based, al- 
lows for fast reconfiguration over the PCI bus, and retains a 
simple physical hardware design. The design considerations 
are first discussed, then the resulting system architecture 
designed is illustrated. Finally, experimental results on the 
FPGA resources utilized for this design are presented. 
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I. Introduction 

Computer processors have for many years been designed 
based on the von-Neumann or Harvard architectures. Soft- 
ware to be run on these processors are compiled into a set 
of processor-specific instructions, which are loaded during 
run-time and executed sequentially. Such sequential pro- 
cessing of an instruction every few clock cycles works well 
enough for typical PC applications such as text editors, 
which have low data processing requirement. 

However, PCs are also often used for computationally in- 
tensive high-throughput data processing, especially in sci- 
entific research work. The sequential nature of the typical 
PC processor, such as the Intel Pentium, becomes a major 
processing bottleneck in such situations. The solution to 
this problem has been to use processors with greater clock- 
speeds, or to network several of these PCs together into a 
cluster or computational grid pp. 

More recently, there has been an increasing interest in 
the use of reconfigurable hardware chips for such compute 
intensive data processing. These chips, such as Field Pro- 
grammable Gate Arrays (FPGAs), possess a fundamen- 
tally different architecture from the typical von-Neumann 
or Harvard type processors. The algorithms to be exe- 
cuted are normally defined in a hardware description lan- 
guage and compiled into a bitstream, which will be down- 
loaded to the FPGA as and when use of the algorithm is 
desired. This bitstream download will reconfigure the hard- 
ware logic on the FPGA accordingly, allowing data passed 
into the FPGA to be processed in hardware, in parallel. 

Several reconfigurable computing research projects |2] [3j 
0] focus on developing new, improved designs of reconfig- 
urable chips. Other groups [5] [H] [7] utilize off-the-shelf 
FPGAs, such as those from Xilinx [5], and work on issues 
such as logic placement and routing optimization 9 . 

Project Proteus ^U] was initiated by the DSP Technol- 
ogy Centre of NgeeAnn Polytechnic to develop a low-cost 
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FPGA-based reconfigurable computing platform for typi- 
cal PCs, with a portable software platform layer and using 
off-the-shelf hardware components. The hardware is de- 
signed for fast FPGA reconfiguration operations, with min- 
imal physical hardware component count and complexity, 
while maintaining the desirable features of a reconfigurable 
platform such as configuration bitstream readback and dy- 
namic reconfigurability features. This paper discusses the 
requirements and design of this hardware architecture. 

Section [H] describes the requirements of the reconfig- 
urable computing platform hardware architecture and how 
this compares to existing solutions, Section IIIII discusses 
the architectural issues addressed and the design of the 
hardware, Section IIVI presents some experimental results 
on the FPGA resource requirements, and finally Section Ivl 
concludes this paper. 

II. Design Considerations 

To understand the architecture of the hardware, it will 
be useful to first discuss the requirements imposed by its 
intended use and desired features. 

Firstly, the reconfigurable computing platform is in- 
tended to be PC-based. The FPGA will therefore be ac- 
cessed over a common PC bus such as PCI-33, PCI-X, 
Fire Wire or USB. A suitable bus has to be selected that 
is fast enough to transfer both data to be processed, and 
the FPGA reconfiguration bitstream. Since there are such 
a wide variety of PC bus interface standards, it is also de- 
sirable for the PC interface core to be swappable to other 
standards, depending on what is available on the PC side. 

Secondly, a major goal of this work has been to develop 
a system that allows for fast reconfiguration of the FPGA 
by the PC. This is with the view that the Proteus Soft- 
ware Platform running on the PC acts as a supervisor, 
downloading algorithms in the form of reconfiguration bit- 
streams to the FPGA as and when desired. This has to be 
fast because in a processing chain of algorithms, we may 
have a scenario where data is first processed by a particu- 
lar algorithm on the FPGA before the results are returned 
to the PC, and the FPGA has to be reconfigured with the 
next algorithm in the processing chain to continue process- 
ing of that data in the next step. Fast reconfiguration will 
minimise the delay introduced by this reconfiguration step. 

Thirdly, it is desirable to keep the physical hardware 
design simple, with as low a component count as possible. 
This will contribute towards one of the goals of keeping the 
development costs of this platform low. One way in which 
this can be done is to run both the PC interface core and 
the algorithm implementation on the same FPGA. How- 
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ever, this will bring about the additional requirement that 
the part of the FPGA holding the algorithm should be 
reconfigurable dynamically and independently of the PC 
interface, even with both on the same chip. To keep the 
hardware design simple, each hardware board also need not 
have multiple FPGAs because the software platform allows 
the concurrent use of FPGAs on multiple boards. Other 
reconfigurable platform architectures have been developed, 
but none satisfies all the requirements given in this sec- 
tion. For example, Blodget et. al have designed a self- 
reconfiguring platform for embedded systems which utilises 
a soft processor core within the FPGA instead of an exter- 
nal PC. Fong et. al. J3| have developed a system that 
uses the RS-232 port to transfer configuration data to the 
FPGA, the low transfer speed of which is a limiting factor 
in reconfiguration performance. 

III. System Architecture 

Considering the requirements set out in Section ^ the 
hardware architecture was designed to utilise the PCI bus 
and the self reconfiguration capability of the Xilinx Virtex 
II FPGA. 

The PCI bus is commonly found in typical PCs, and 
allows the hardware to be fully powered from the bus it- 
self. This removes the need for additional power-supply 
circuits, and avoids electric current separation issues in 
PCI signalling. In addition, the high-throughput of the 
bus (132MByte/sec for the 33MHz 32-bit PCI) allows for 
fast transfer of the reconfiguration bitstream from the PC 
to the FPGA. The maximum rate at which a bitstream 
can be transferred over the SelectMap bus to reconfigure 
a Xilinx Virtex II FPGA is 50MByte/sec, so the PCI bus 
can easily allow for operation of the SelectMap bus at its 
maximum speed. 

Using the self reconfigurability feature of the FPGA al- 
lows one part of the FPGA to hold the PCI interface, 
while the other holds the downloaded algorithm. This al- 
lows the hardware board to remain very simple, removing 
the need for a separate PCI interface chip. This approach 
also reduces the number of possible points of system failure 
caused by physical connections. At the same time, the Xil- 
inx modular design flow allows the PCI interface and 
algorithm parts to be developed and tested independently, 
even though both are on the same FPGA. 

The combination of these features satisfies all the three 
requirements of using a common PC bus standard, fast 
FPGA reconfiguration timings, and a simple physical hard- 
ware board. The details of the system architecture are ex- 
plained below. 

A. Physical Hardware Design and Initial Bootup Configu- 
ration 

As described in the third design consideration of Section 
ITTI it is desirable for the physical hardware to be kept as 
simple as possible. The hardware board was therefore de- 
signed with only two main components - the FPGA and a 
Flash RAM chip, as shown in Figure ^ 




Fig. 1 

Hardware Design 



The FPGA is placed as close as possible to the physical 
PCI interface, so that it will satisfy the PCI specification 
requirements for signal line lengths. 

The Flash memory chip holds the FPGA's initial con- 
figuration bitstream, which is downloaded to the FPGA 
through the SelectMap interface upon power-up. This is 
necessary because SRAM-based FPGAs are volatile and 
lose their configuration when power is removed. This ini- 
tial configuration bitstream holds the logic for the PCI in- 
terface core and the self-reconfiguration controller. 

Once the initial configuration is done, the host PC may 
download partial bitstreams over the PCI bus to reconfig- 
ure the FPGA according to the desired algorithm. 

In designing the physical hardware, it was decided that 
only a single FPGA would be included on each board. 
Other hardware designs exist that use arrays of FPGAs, 
but these introduce additional design issues in interchip 
connectivity, communication protocol specification, and 
task distribution. With the increase in the density and 
available logic area of FPGAs in recent years, complex al- 
gorithms can now be easily contained in a single FPGA. 

Having a single chip solution benefits the algorithm de- 
signer in that he will not need to be concerned with the 
complexity of breaking his design down into various parts 
for each FPGA, and constraining communications between 
these parts to the interchip signal lines that have been 
routed. In an event where more than one FPGA is needed, 
the Proteus Software Platform can utilize multiple boards 
concurrently. 

B. FPGA Partitioning for Single-Chip Architecture 

The use of a single FPGA to hold both the PCI core and 
the desired algorithm involves partitioning the chip into 
two modules - a 'fixed' part and a 'reconfigurable' part. 
The 'fixed' part holds the PCI core and configuration con- 
troller, and is defined by the initial full bitstream down- 
loaded from the Flash upon bootup. The 'reconfigurable' 
part is dynamically configured according to the desired al- 
gorithm, via a corresponding partial bitstream download 
from the PC. This effectively uses the FPGA as a partial 
self-reconfigurable system, with the fixed part configura- 
tion controller internally performing this dynamic reconfig- 
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uration operation on the 'reconfigurable' part. Both these 
parts are linked via Bus-Macros |15| . which provide a well- 
defined physical interface of signal lines. Figure[2]shows the 
partitioning of the FPGA for these four main components. 

FPGA 



Reconfigurable 
Part 




Fig. 2 

Partitioning of the FPGA 



The Xilinx Virtex 2 XC2V3000 was used for the proto- 
type developed. The architecture of the Virtex 2 family re- 
quires module areas to span the full height of the chip. Re- 
configuration granularity is thus restricted to full columns, 
and modules correspondingly have only a 1-dimensional 
flexibility of area placement. 

The 'fixed' part is constrained to occupy the right-most 
columns because the Internal Configuration Access Port 
(ICAP) in a Virtex 2 FPGA is located at the lower-right 
corner. The ICAP is used by the Configuration Control 
logic to reconfigure the 'reconfigurable' part of the chip. 
The 10 pin pads used by the PCI core are accordingly 
lined up along the right side border of the FPGA. 

C. Modular Design Flow 

In implementing algorithms for the reconfigurable part, 
it will be undesirable and inconvenient if the entire design 
(including the fixed part) has to be compiled and tested as 
a whole each time. The Xilinx Modular Design Flow al- 
lows each module (the 'fixed' part and the 'reconfigurable' 
part) to be developed, tested, and debugged independently. 
Only the reserved module area and the Bus-Macro interface 
have to remain consistent, and are defined in the common 
'top-level' information used in individual module designs. 
The module area constraints are used in the place-and- 
route (par) step to define the module boundaries, outside 
of which logic placement and signal routing is disallowed. 

From the individual module design, a 'partial bitstream' 
can be created that represents the configuration informa- 
tion for a single module. This 'partial bitstream' created 
for the 'reconfigurable' part is dynamically downloaded 
from the PC to reconfigure the corresponding portion on 
the FPGA, according to the desired algorithm. 

The 'full bitstream' derived from the design of the 'top- 
level' and 'fixed' part is stored on the Flash for the initial 
boot-up configuration of the FPGA. 

Although Xilinx has stated ^1] that it is not possible to 



use the PCI core in the Modular Design Flow, this work 
has shown that it is actually possible to do so. 

D. Self- Reconfiguration using the Internal Configuration 
Access Port (ICAP) 

The Configuration Controller of the 'fixed' part uses the 
Internal Configuration Access Port (ICAP) of the Xilinx 
Virtex 2 FPGA to perform self-reconfiguration. This pro- 
cess is carried out without influence of the host PC's CPU. 

The Configuration Controller obtains the reconfiguration 
data from a memory location shared with the PC. This 
avoids having to place a restriction on the configuration bit- 
stream size based on the available on-chip FPGA RAM. In 
other self-reconfigurable systems ^2] E, > a limited amount 
of on-chip memory is used to store the reconfiguration data. 
This results in the self-reconfiguration process taking a long 
time because of the need to iteratively load portions of the 
configuration bitstream and incrementally reconfigure the 
FPGA. 

A second limitation is in the throughput of the link over 
which the reconfiguration bitstream is sent - this often be- 
comes a bottleneck in the speed of reconfiguration if a low 
throughput connection is used, such as in 13 where an 
RS232 serial line is used. 

In this work, the bus-master capability of the PCI core 
allows for direct use of the PC's memory, which is large 
enough to hold an entire configuration bitstream. The data 
is transferred in a continuous stream over the PCI bus to 
the ICAP, removing the need for incremental reconfigura- 
tion and offering the fastest possible speed of reconfigura- 
tion over the high-throughput PCI bus. 

The ICAP uses the Select Map Bus protocol, so the 
Configuration Controller has to act as a bridge between 
that and the PCI bus. To provide for clock indepen- 
dence between these parts, on-chip dual-port Block RAMs 
(BRAMs) are used as a buffer. This technique avoids the 
need to exchange ownership of a shared RAM space be- 
tween the PCI Interface and the Select Map Interface as 
done in our previous work |16| . thereby minimizing the 
latency time in accessing the ICAP. The prototype imple- 
mentation uses two BRAMs (on a Virtex 2 XC2V3000) for 
a total buffer size of 256x32bit. 

Internally, the Configuration Controller works with 32- 
bit items to match the width of the PCI bus. A multi- 
plexer segments each of these 32-bit items into four single 
byte items, for passing on to the Select Map bus that is 
8-bit wide. In the event where the PCI bus fails to deliver 
sufficient data to the buffer in time, the controller 'pauses' 
the operation by stopping the configuration clock. It is 
resumed once data is available again. 

E. Fixed Part Design 

The PCI Interface is used for two purposes: to transfer 
partial reconfiguration bitstreams to the FPGA, and to ex- 
change data with the algorithm downloaded to the FPGA. 
Both of these transfers can use the bus-master feature of 
the PCI bus, which allows direct shared access to the PC's 
RAM to obtain data. This ensures that the design on the 
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FPGA is not limited by the small amount of on-chip RAM, 
and provides for the fastest possible direct data transfer 
from the PC. 

The design of the fixed part which allows for this is shown 
in Figure |21 



clock, while the other port is accessed from the reconfig- 
urable part and driven with the clock from that. A buffer 
size of 256x32bits for each stream port has been found suf- 
ficient to contain the access latency of the PCI bus. 

The functional blocks of the Fixed Part Control Section 
is shown in Figure 
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Fig. 3 
Fixed Part Design 



The major components of the fixed part include the 
Fixed Part Control Section, Reconfig Part Control Section, 
Common Part Control Section, Stream Data Section, and 
Select Map Data Section. 

The fixed part contains the control logic necessary for 
busmaster mode access of the PCI bus. It also handles all 
communication with the device driver, and transports data 
to/from the reconfigurable part. The fixed part acts as a 
bridge between the packet oriented transfer of the shared 
PCI bus and the continuous data stream transfer with the 
reconfigurable part. 

The fixed part is split into three control sections - the 
Common Part Control, Reconfig Part Control, and Fixed 
Part Control, and two data sections - Stream Data and Se- 
lect Map Data. The Common Part Control holds PCI ac- 
cess specific functions and contains the interrupt generation 
logic. The Reconfig Part Control allows the reconfigurable 
part to easily implement static registers that can be indi- 
vidually addressed, read from, and written to. Using these 
registers, the reconfigurable part may obtain setup values 
from the PC side and report back status information. The 
reconfigurable part may also invoke interrupts via the fixed 
part design. The Fixed Part Control section controls the 
PCI bus / reconfigurable part data transfer tasks. The 
Data sections each hold Dual-Port Block RAMs (BRAMs) 
to buffer data between the PCI bus and the reconfigurable 
part. The Dual-Port feature allows the BRAMs to be ac- 
cessed from different clock domains on each side. One port 
is accessed from the PCI side and driven with the PCI 
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Fig. 4 

Fixed Part Control Section 



The Fixed Part Control contains a Busmaster Initiator 
and a Busmaster Address Provider. The Busmaster Ini- 
tiator starts bus-master data transfers with the allocated 
RAM on the PC, over the PCI bus. This is done at each 
transfer event, which are triggered on the fill status of the 
internal BRAM buffers. The Busmaster Address Provider 
keeps track of the PCI addressing, and is needed because 
a transfer on the shared PCI bus can be interrupted at 
any time. When such an interruption happens the transfer 
has to be restarted at the next address, after the bus is 
available again. 

The Transfer Arbitration block schedules the utilization 
of the PCI Interface by the four possible stream targets - 
Upstream, Downstream, Select Map Read, and Select Map 
Write. All the four targets utilize the same PCI Interface, 
so the Transfer Arbitration block uses a simple scheduling 
algorithm to equally allocate data transfer requests - one 
target is not allowed to request the PCI Interface twice 
in a row if multiple requests are pending. Each stream 
target has a corresponding Control section, which generate 
data transfer requests and monitor the status of the BRAM 
buffers in the Data sections. 

IV. Experimental Results 

The fixed part design as described in the previous section 
was compiled for a Xilinx XC2v3000ffll52-4 FPGA using 
the Xilinx ISE 6.2. Figure shows a screen capture of the 
ISE FloorPlanner with the regions of the FPGA allocated 
for the fixed part in yellow, and that for the reconfigurable 
part in blue. Figure [S] is a screen capture from the ISE 
FPGA Editor, showing the resources used by the fixed part. 
A total of 2211 slices (out of a total of 14336) and 3 BRAMs 
are used, taking up about 15% of the XC2V3000. 

V. Conclusion 

This paper has presented a reconfigurable computing 
platform hardware architecture that satisfies the design re- 
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Fig. 5 

Regions of the FPGA allocated for the fixed part (yellow) 

AND THE REOONFIGURABLE PART (BLUE) 




Fig. 6 

Resources used by the Fixed Part 



quirements of being PC based, allowing for fast reconfigura- 
tion over the PCI bus, and simplicity of physical hardware 
design. 
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